Diagnosing Diseases using kNN

An application of kNN to diagnose Diabetes

Jacqueline Razo (Advisor: Dr. Cohen)

2025-04-13

Introduction

  • The k-Nearest-Neighbors (kNN) is an algorithm that is being used in a variety of fields to classify or predict data.

  • It’s a simple algorithm that classifies data based on how similar a datapoint is to a class of datapoints.

  • One of the benefits of using this algorithmic model is how simple it is to use and the fact it’s non-parametric which means it fits a wide variety of datasets.

  • One drawback from using this model is that it does have a higher computational cost than other models which means that it doesn’t perform as well or fast on big data.

  • In this project we focused on the methodology and application of classification kNN models in the field of healthcare to predict diabetes.

Methods - Basics

  • The kNN algorithm is a nonparametric supervised learning algorithm that can be used for classification or regression problems. (Syriopoulos et al. 2023)

  • In classification, it classifies a datapoint by using the euclidean distance formula to find the nearest k data specified. Once these k data points have been found, the kNN assigns a category to the new datapoint based off the category with the majority of the data points that are similar.

  • Figure 1 illustrates this methodology with two distinct classes of hearts and circles.

  • Figure 1

Methods Continued

Figure 1 illustrates this methodology with two distinct classes of hearts and circles. The knn algorithm is attempting to classify the mystery figure represented by the red square. The k parameter is set to k=5 which means the algorithm will use the euclidean distance formula to find the 5 nearest neighbors illustrated by the green circle. From here the algorithm simply counts the number from each class and designates the class that represents the majority which in this case is a heart.

  • Figure 1

Methods- The classification process

The classification process has three distinct steps:

  1. Distance calculation The knn first measures the distance between the datapoint it’s trying to classify and all the training data points. There are different distance calculation methods that can be used but the default and most commonly used method with the kNN is the Euclidean distance formula.

\[ d = \sqrt{(X_2 - X_1)^2 + (Y_2 - Y_1)^2} \] 2. Neighbor Selection The kNN allows the selection of a parameter k that is used by the algorithm to choose how many neighbors will be used to classify the unknown datapoint. Studies recommend using cross-validation or heuristic methods, such as setting k to the square root of the dataset size, to determine an optimal value

  1. Classification decision based on majority voting

Once the k-nearest neighbors are identified, the algorithm assigns the new data point the most frequent class label among its neighbors. In cases of ties, distance-weighted voting can be applied, where closer neighbors have higher influence on the classification decision

Assumptions

The kNN algorithm calculates the euclidean distance between the unknown datapoint and the testing datapoints because it assumes similar datapoints will be in close proximity to each other and be neighbors and that data points with similar features belong to the same class. (boateng2020basic?)

Pre-processing Data

  • Handle missing values: kNN’s work by calculating the distance between datapoints and missing values can skew the results. We must remove the missing values by either inputting them or dropping them.
  • Make all values numeric: kNN’s only handle numeric values so all categorical values must be encoded using either one-hot encoding or label encoding.
  • Normalize or Standardize the features: We must normalize or standardize the features to make sure we reduce bias. We can use the min-max scaler or the standard scaler to do this.
  • Reduce dimensionality: The kNN can struggle to calculate the distance between features if there are too many features. In order to solve this we can use Principal Component Analysis to reduce the number of features but keep the variance.
  • Remove correlated features: The kNN works best when there aren’t too many features, so we can use a correlation matrix to see which features we can drop. For example, it might be good to drop any features that have low variance or have a high correlation over 0.9 because this can be redundant.
  • Fix class imbalance: Class imbalances can lead to a bias. We noticed a class imbalance in our dataset and chose to use Synthetic Minority Over-sampling Technique(SMOTE) in order to handle the imbalance.

Hyperparameter Tuning

In order to increase the accuracy of the model there are a few parameters that we can adjust.

  1. Find the optimal k parameter: We can use gridsearch to find the best parameter k.
  2. Change the distance metric: The kNN uses the euclidean distance by default but we can use the Manhattan distance, the Minkowski distance or another distance.
  3. Weights: The kNN defaults to a “uniform” weight where it gives the same weight to all the distances but it can be adjusted to “distance” so that the closest neighbors have more weight.

Data Exploration and Visualization

  • We explored the CDC Diabetes Health Indicators dataset, sourced from the UC Irvine Machine Learning Repository. It is a set of data that was gathered by the Centers for Disease Control and Prevention (CDC) through the Behavioral Risk Factor Surveillance System (BRFSS), which is one of the biggest continuous health surveys in the United States.

  • Python and the ucimlrepo package was used to import the dataset directly from the UCI Machine Learning Repository, following the recommended instructions. This enabled us to easily save, prepare, and analyze the data in view of the current research.

Data Exploration and Visualization

  • The dataset consists of 253,680 survey responses and contains 21 feature variables and 1 binary target variable named Diabetes_binary

Key Findings:

There are no missing values, meaning no imputation is needed.

Figure 2 shows a graph of the mean of different features in the data. It shows BMI which is a continuous variable indicating body mass index and the 6 ordinal values that includes demographics such as age, income, and education and the self-reported health status of GenHlth, MentHlth, PhysHlth.

Data Exploration and Visualization Cont

Outliers

Data Exploration and Visualization Cont

Next, we will take a look at the binary features. Figure 4 shows us the balance between classes 0 and 1.

Correlation Analysis

A correlation heatmap was generated in Figure 5 to examine relationships between variables. The correlation heatmap helps identify strongly correlated features, which may lead to redundancy in the model.

Key Findings from Data Exploration and Visualizations:

Class Imbalance:

Only 13.9% of people have diabetes, which suggests an imbalance in the target variable. This may require oversampling (SMOTE) or class weighting when training models.

Modeling and Results

Data Preprocessing

  • There was no missing data so we didn’t have to remove or impute any values.
  • We started cleaning the data by dropping these duplicates.
  • We kept the ordinal variables the same as they have ameaningful natural order that will provide the kNN with meaningful distances.
  • We divided the data into testing and training data. We used test_size=0.2 to use 80% of the data for training the kNN and 20% of the data for testing.
  • We chose to standardize them so that BMI and age could be on the same scale as the other features.

Modeling and Results Cont

Models

We chose to create three classification kNN models to illustrate the methodology.

Table 3: Model Summary
Model Name k value Weights Distance SMOTE
Model 1 5 'uniform' Euclidean No
Model 2 8 'uniform' Euclidean No
Model 3 5 'distance' Euclidean Yes

Modeling and Results- Evaluating and comparing the models

The table below shows the summary of the three models.

Table 1: KNN Model Performance Summary
  Model k Weight SMOTE Accuracy F1 Score Precision Recall ROC AUC
0 Model 1 5 Uniform No 83.22% 27.77% 40.66% 21.09% 0.71
1 Model 2 8 Uniform No 84.46% 19.47% 46.98% 12.28% 0.74
2 Model 3 5 Distance Yes 70.14% 37.45% 27.55% 58.44% 0.70

Conclusion

Model 2 has the highest accuracy at 84.46% but this accuracy score is high because it is good at detecting the non-diabetic cases which are the majority of cases. It also has the highest ROC AUC score of 0.74 which means it’s the best model at seperating different classes; however, the recall is 12.28%. This means the model is only correctly classifying 12.28% of the actual positive cases for diabetes. Since our purpose of using the kNN is to detect diabetes we wouldn’t want to use this model. This leaves model 3 which has an accuracy of 70.14% and a much higher recall of 58.44%. Model 3 is able to correctly identify a little over half of the positive diabetes cases. This allows us to see how using the distance weight and using SMOTE to balance the classes lead to a better model.

In this project we created three kNN models that were trained to classify unknown datapoints into diabetes or non-diabetes classes using the data from UC Irvines Machine Learning Repository called CDC Diabetes Health indicators. We were able to see how fine tuning a kNN model can help us detect diabetes in the healthcare setting.

References